1 Load the data

The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ā€˜patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.

Schematic depicting the iPSC differentiation strategy for motor neurogenesis. Arrows indicate sampling time-points in days when cells were fractionated into nuclear and cytoplasmic compartments prior to deep (polyA) RNA-sequencing. Four iPSC clones were obtained from four different healthy controls and three iPSC clones from two ALS patients with VCP mutations: R155C and R191Q; hereafter termed VCPmu. Induced-pluripotent stem cells (iPSC); neural precursors (NPC); ā€œpatternedā€ precursor motor neurons (ventral spinal cord; pMN); post-mitotic but electrophysiologically inactive motor neurons (MN); electrophysiologically active MNs (mMN). The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.

load("./data_09_04_2024.RData")
#Data: 
# myE_ge                  : raw gene expression count matrix 
# info                    : sample annotation (data-frame)
# ttg                     : rows (genes) annotation

# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge      <- myE_ge[,sel_samples]
info        <- info[sel_samples,]
info$group  <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))


#Make some nice colors to facilitate the visualisation of time-points
mytime                 <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

2 Get familiar with your data

How many samples is there in your count data?

Does this correspond to your experimental protocol?

Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.

What are the covariates?

how many rows?

Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.

#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))
## [1] 0

How does your data look-like?

par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")

3 Pre-processing of the count data

3.1 Variance stabilisation with log-transformation

Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:

Let’s have a close look at how the variance in gene expression scale with average and then correct it with some transformation.

#calculate mean and variance of the rows
row_avg<-apply(myE_ge,1,mean)
row_var <-apply(myE_ge,1,var)

#Log-transform your count data
myE_gel <- log2(1+myE_ge)

#DESeq2 variance stabilisation
vsd    <- DESeq2::varianceStabilizingTransformation(matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE))

par(mfrow=c(1,3))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()

plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()

plot(apply(vsd,1,mean),apply(vsd,1,var),las=1,main="DESeq2 variance stabilised",pch=19,col=rgb(0,0,0,0.05),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()

3.2 Identification of reliably expressed genes

Let’s first have a look at the distribution of gene expression for a few samples:

par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")

We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.

We can now select for the reliably expressed genes in each sample:

3.3 Normalisation

Let’s first look at the ditribution of the gene count across a few samples:

There are several options for normalisation. Scaling factors, quantile normalisation etc.

What is the effect on the gene count?

4 Unsupervised hierarchical clustering analysis of the samples

Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.

Let’s first compare the clustering of the samples using the Manhattan distance and Ward D algorithm:

Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Let’s compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:

5 Differential gene expression analysis

There are many useful R packages to perform differential gene expression analysis of bulk RNA-sequencing data usge as Sleuth, edgeR and DESeq2.

5.1 Some useful functions for DEG analysis

#Volcano plot
VolcanoPlot <- function(myDGE=res, col_log2FC=res$log2FoldChange, col_pval=res$padj, timepoint=3){
  mycol_genes                                             <- rep(rgb(0.7,0.7,0.7,0.2))
  mycol_genes[abs(col_log2FC)>=1.0&col_pval<=0.05]<- rep(rgb(0.0,0.0,0.0,0.2))
  plot(col_log2FC,-log10(col_pval),pch=19,cex=0.3,col=mycol_genes,xlab="log2FC",ylab="-log10(P-value)", main=paste("DIV_", timepoint, "vs_0"), frame=FALSE)
  abline(h=-log10(0.01),lty=2,col="grey")
  abline(v=c(-1,1),lty=2,col="grey")
}

#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
  gostres_diff <- gost(query = genes_list, 
                  organism = "hsapiens", ordered_query = FALSE, 
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                  measure_underrepresentation = FALSE, evcodes = TRUE, 
                  user_threshold = 0.05, correction_method = "g_SCS", 
                  domain_scope = "annotated", custom_bg = NULL, 
                  numeric_ns = "", as_short_link = FALSE, sources=c("GO:BP", "GO:MF","KEGG"))
  gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}

#Create a venn diagram from two gene lists 
## This is an example - not meant to run
vennDiag <- function(genes_lists){
  genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
  colnames(genes_comparisons)<-c("cond1","cond2")
  vennDiagram(genes_comparisons,main="blasdfas")
}

5.2 DESeq2

Here is a tutorial how to run DESeq2 on this data:

mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV              <- factor(info$DIV,levels=c("0","3","7","14","22","35"))

#How to fit DEseq2:
dds                   <- DESeq2::mDESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
                              colData = info,
                              design= ~ DIV + Fraction)
dds                   <- DESeq2::DESeq(dds)
mycoeff               <- resultsNames(dds) # lists the coefficients

#How to extract a result (for a specific comparison: if comparison corresponds to mycoeff index = "DIV_14_vs_0":
res                   <- results(dds, name="DIV_14_vs_0")
genes_up              <- as.character(rownames(res))[res$log2FoldChange>1.0&res$padj<0.01]
genes_do              <- as.character(rownames(res))[res$log2FoldChange<(-1.0)&res$padj<0.01]

5.3 EdgeR - Quasi-likelihood F-tests

Here is a tutorial how to run EdgeR with Quasi-likelihood F-tests on this data:

mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)

info$DIV              <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y                     <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design                <- model.matrix(~info$group)
y                     <- estimateDisp(y,design)

#To perform quasi-likelihood F-tests:
fit <- glmQLFit(y,design)
mycoefs <- colnames(fit$design)

#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups. 
#To compare Nuclear_3 vs Nuclear_0:
qlf.nuc3 <- glmQLFTest(fit, coef=2)
res<- topTags(qlf.nuc3,n=sum(is_expressed_global))

#To compare Nuclear_7 vs Nuclear_3
qlf.nuc3.nuc7 <- glmQLFTest(fit, contrast=c(0,-1,1,rep(0,9)))
res<- topTags(qlf.nuc3.nuc7,n=sum(is_expressed_global))

res                   <- topTags(glmQLFTest(fit, coef=3),n=sum(is_expressed_global))$table
genes_up              <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do              <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]

5.4 EdgeR - Likelihood ratio tests

Here is a tutorial how to run EdgeR with Likelihood ratio tests on this data:

mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)

info$DIV              <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y                     <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
design                <- model.matrix(~info$group)
y                     <- estimateDisp(y,design)


#To perform likelihood ratio tests:
fit     <- glmFit(y,design)
mycoefs <- colnames(fit$design)
#The fit has 12 parameters. The first is the baseline level of group 1 (Nuclear_0). The remaining are the difference between Nuclear_0 and the other groups. 
#To compare Nuclear_3 vs Nuclear_0:
lrt.nuc3 <- glmLRT(fit,coef=2)
res<- topTags(lrt.nuc3,n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]

6 Graded homework

Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.

6.1 Task 1 (1 pt)

In this task, you will perform a differential gene expression using Deseq2.

  1. For each time point, compute the number of differentially expressed genes (up and down). Show these numbers in barplots: you should obtain a barplot with 5 bars (one for each time point) for the upregulated genes, and similarly one barplot for the downregulated genes. Comment on the evolution of these numbers.

Since, we have comparison between 0 time point and every other time point, which are 3, 7, 14, 22 and 35, we expect 2 different plots, one for upregulated genes and the other one with downregulated genes, each with 5 bars in the box plot. For obtaining the sets of upregulated and downregulated genes for each time point we used DESeq2 package. When selecting the design formula, we opted for the DIV and Fraction columns. Since the design formula tells the statistical software for which sources of variation to test for, besides taking your factors of interest (which is DIV in this case), it is also important to include any additional covariates that are sources of variation (therefore addition of the Fraction).

mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
info$DIV              <- factor(info$DIV,levels=c("0","3","7","14","22","35"))

#How to fit DEseq2:
dds                   <- DESeq2::DESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
                              colData = info,
                              design= ~ Fraction + DIV)
dds                   <- DESeq2::DESeq(dds)
mycoeff               <- resultsNames(dds) # lists the coefficients

#How to extract a result (for a specific comparison: if comparison corresponds to mycoeff index = "DIV_14_vs_0":

#Extracting the result that correspond to index = "DIV_3_vs_0"
res_3                   <- results(dds, name="DIV_3_vs_0")
genes_up_3              <- length(as.character(rownames(res_3))[res_3$log2FoldChange>1.0&res_3$padj<0.01])
genes_do_3              <- length(as.character(rownames(res_3))[res_3$log2FoldChange<(-1.0)&res_3$padj<0.01])

#Extracting the result that correspond to index = "DIV_7_vs_0"
res_7                   <- results(dds, name="DIV_7_vs_0")
genes_up_7              <- length(as.character(rownames(res_7))[res_7$log2FoldChange>1.0&res_7$padj<0.01])
genes_do_7              <- length(as.character(rownames(res_7))[res_7$log2FoldChange<(-1.0)&res_7$padj<0.01])

#Extracting the result that correspond to index = "DIV_14_vs_0"
res_14                   <- results(dds, name="DIV_14_vs_0")
genes_up_14              <- length(as.character(rownames(res_14))[res_14$log2FoldChange>1.0&res_14$padj<0.01])
genes_do_14              <- length(as.character(rownames(res_14))[res_14$log2FoldChange<(-1.0)&res_14$padj<0.01])

#Extracting the result that correspond to index = "DIV_22_vs_0"
res_22                   <- results(dds, name="DIV_22_vs_0")
genes_up_22              <- length(as.character(rownames(res_22))[res_22$log2FoldChange>1.0&res_22$padj<0.01])
genes_do_22              <- length(as.character(rownames(res_22))[res_22$log2FoldChange<(-1.0)&res_22$padj<0.01])

#Extracting the result that correspond to index = "DIV_35_vs_0"
res_35                   <- results(dds, name="DIV_35_vs_0")
genes_up_35              <- length(as.character(rownames(res_35))[res_35$log2FoldChange>1.0&res_35$padj<0.01])
genes_do_35              <- length(as.character(rownames(res_35))[res_35$log2FoldChange<(-1.0)&res_35$padj<0.01])

upregulated <- c(genes_up_3, genes_up_7, genes_up_14, genes_up_22, genes_up_35)
downregulated <- c(genes_do_3, genes_do_7, genes_do_14, genes_do_22, genes_do_35)
identification <- c("3_vs_0", "7_vs_0", "14_vs_0", "22_vs_0", "35_vs_0")

As we can observe from the plots above, numbers of both upregulated and downregulated genes increase over time. These changes in numbers are expected since we are comparing the numbers with the respect to the 0 time point, which corresponds to the undifferentiated state of the cell (stem cell state). While it progresses towards the differentiated cell state, many genes gain activity with the transition to the new cell purpose, but also many lose their activity, thereby such a plot. It is interesting to note that the biggest difference in numbers for both upregulated and downregulated genes is achieved between time point 14 and time point 22.

  1. For each time point, display the volcano plot showing the differentially expressed genes.

With volcano plot besides differentially expressed genes, we can also observe their statistical significance (p-value) versus magnitude of change (fold change). The most upregulated genes are towards the right, the most downregulated genes are towards the left, and most statistically significant genes are towards the top. It is important to note that \(-log_{10}(P_{value})\) increase with the time, meaning that genes in the last plot (time point 35 compared to the time point 0) have higher values compared to the other.

  1. For the time point D0–>D3, show the biological pathways associated with the upregulated and downregulated genes. Comment.

The plot we obtained displays enrichment of gene ontology (GO) terms. The x-axis represents functional terms that are grouped and color-coded. From GO library: molecular function (MF), biological process (BP), cellular components (CC) and KEGG pathways. The y-axis shows the adjusted enrichment p values in āˆ’log10 (Padj). For upregulated genes, top GO terms associated with MF, BP and KEGG are:

If we take a closer look into the top GO terms associated with the upregulated genes, we can come to the conclusion that the vast majority of them is associated with the system development which is expected since the cells on which we are performing RNA-seq and differential gene expression analysis are differentiating from human induced pluripotent stem cells towards the electrophysiologicaly active motor neurons. Also, we observe upregulation of pathway associated with calcium ion binding which is important for functioning of motor neurons, as well as wnt-protein binding which is secreted growth factor involved in signaling.

When it comes to downregulated genes, top GO terms associated with MF, BP and KEGG are:

genes_up_3_names = as.character(rownames(res_3))[res_3$log2FoldChange>1.0&res_3$padj<0.01]
genes_do_3_names = as.character(rownames(res_3))[res_3$log2FoldChange<(-1.0)&res_3$padj<0.01]

par(mfrow=c(1,2))
GO_analysis(genes_up_3_names)
GO_analysis(genes_do_3_names)

6.2 Task 2 - Bonus (0.25 pt)

Do the same as task 1 but using EdgeR – Likelihood ratio tests. Comment on the differences between the techniques.

mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)

# I'm not sure about this, but I think we should change group to DIV.
info$DIV              <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
y                     <- DGEList(counts=mycountData[is_expressed_global,],group=info$DIV)
design                <- model.matrix(~info$Fraction + info$DIV)
y                     <- estimateDisp(y,design)


#To perform likelihood ratio tests:
fit     <- glmFit(y,design)
mycoefs <- colnames(fit$design)

#We have 6 coefficients. Baseline level of DIV_0 is coeff 1. The remaining are the difference between DIV_0 and the other groups. 

#To compare DIV_3 vs DIV_0:
lrt.div3 <- glmLRT(fit,coef=3)
res_3R <- topTags(lrt.div3,n=sum(is_expressed_global))$table
genes_up_3R <- length(as.character(rownames(res_3R))[res_3R$logFC>1.0&res_3R$FDR<0.01])
genes_do_3R <- length(as.character(rownames(res_3R))[res_3R$logFC<(-1.0)&res_3R$FDR<0.01])

#To compare DIV_7 vs DIV_0:
lrt.div7 <- glmLRT(fit,coef=4)
res_7R <- topTags(lrt.div7,n=sum(is_expressed_global))$table
genes_up_7R <- length(as.character(rownames(res_7R))[res_7R$logFC>1.0&res_7R$FDR<0.01])
genes_do_7R <- length(as.character(rownames(res_7R))[res_7R$logFC<(-1.0)&res_7R$FDR<0.01])

#To compare DIV_14 vs DIV_0:
lrt.div14 <- glmLRT(fit,coef=5)
res_14R <- topTags(lrt.div14,n=sum(is_expressed_global))$table
genes_up_14R <- length(as.character(rownames(res_14R))[res_14R$logFC>1.0&res_14R$FDR<0.01])
genes_do_14R <- length(as.character(rownames(res_14R))[res_14R$logFC<(-1.0)&res_14R$FDR<0.01])

#To compare DIV_22 vs DIV_0:
lrt.div22 <- glmLRT(fit,coef=6)
res_22R <- topTags(lrt.div22,n=sum(is_expressed_global))$table
genes_up_22R <- length(as.character(rownames(res_22R))[res_22R$logFC>1.0&res_22R$FDR<0.01])
genes_do_22R <- length(as.character(rownames(res_22R))[res_22R$logFC<(-1.0)&res_22R$FDR<0.01])

#To compare DIV_35 vs DIV_0:
lrt.div35 <- glmLRT(fit,coef=7)
res_35R <- topTags(lrt.div35,n=sum(is_expressed_global))$table
genes_up_35R <- length(as.character(rownames(res_35R))[res_35R$logFC>1.0&res_35R$FDR<0.01])
genes_do_35R <- length(as.character(rownames(res_35R))[res_35R$logFC<(-1.0)&res_35R$FDR<0.01])

upregulatedR <- c(genes_up_3R, genes_up_7R, genes_up_14R, genes_up_22R, genes_up_35R)
downregulatedR <- c(genes_do_3R, genes_do_7R, genes_do_14R, genes_do_22R, genes_do_35R)
identificationR <- c("3_vs_0", "7_vs_0", "14_vs_0", "22_vs_0", "35_vs_0")

With the edgeR tehnique, we can observe similar results to the ones achieved with the DESeq2 method. There is still the largest difference in numbers of upregulated and downregulated genes between time point 22 and time point 14. On the other hand, we can also notice slight decrease in numbers of upregulated genes in edgeR tehnique, as well as slight increase in numbers of downregulated genes.

Regarding the Volcano plot, we again observe similar results as in the previous case, with slight difference being that we are plotting \(-log_{10}FDR\) instead of adjusted p-value.

genes_up_3R_names = as.character(rownames(res_3R))[res_3R$logFC>1.0&res_3R$FDR<0.01]
genes_do_3R_names = as.character(rownames(res_3R))[res_3R$logFC<(-1.0)&res_3R$FDR<0.01]

par(mfrow=c(1,2))
# Upregulated genes
GO_analysis(genes_up_3R_names)
# downregulated genes
GO_analysis(genes_do_3R_names)

GO enrichment analysis results in the same pathways being enriched with both edgeR and DESeq2 methods.